Human pose estimation (i.e., locating the body parts / joints of a person) isa fundamental problem in human-computer interaction and multimediaapplications. Significant progress has been made based on the development ofdepth sensors, i.e., accessible human pose prediction from still depth images[32]. However, most of the existing approaches to this problem involve severalcomponents/models that are independently designed and optimized, leading tosuboptimal performances. In this paper, we propose a novel inference-embeddedmulti-task learning framework for predicting human pose from still depthimages, which is implemented with a deep architecture of neural networks.Specifically, we handle two cascaded tasks: i) generating the heat (confidence)maps of body parts via a fully convolutional network (FCN); ii) seeking theoptimal configuration of body parts based on the detected body part proposalsvia an inference built-in MatchNet [10], which measures the appearance andgeometric kinematic compatibility of body parts and embodies the dynamicprogramming inference as an extra network layer. These two tasks are jointlyoptimized. Our extensive experiments show that the proposed deep modelsignificantly improves the accuracy of human pose estimation over other severalstate-of-the-art methods or SDKs. We also release a large-scale dataset forcomparison, which includes 100K depth images under challenging scenarios.
展开▼